Skip to content

[DeepSpeed] scale grad for zero-2#3880

Closed
kashif wants to merge 1 commit intohuggingface:mainfrom
kashif:deepspeed-grad-acc
Closed

[DeepSpeed] scale grad for zero-2#3880
kashif wants to merge 1 commit intohuggingface:mainfrom
kashif:deepspeed-grad-acc

Conversation

@kashif
Copy link
Contributor

@kashif kashif commented Dec 9, 2025

What does this PR do?

This pull request updates the backward method in src/accelerate/accelerator.py to ensure consistent loss scaling across all distributed training backends.

Distributed training consistency:

  • The loss is now always scaled by gradient_accumulation_steps, regardless of the backend, to prevent incorrect accumulation in DeepSpeed ZeRO-2 and similar scenarios. This change makes loss scaling explicit and consistent for all distributed types.

Fixes #3877

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@kashif kashif changed the title scale grad [DeepSpeed] scale grad for zero-2 Dec 9, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

# Note: DeepSpeed does NOT automatically scale loss during backward for all ZeRO stages,
# particularly ZeRO-2 where gradient partitioning can cause incorrect accumulation
# if the loss is not pre-scaled. This ensures consistent behavior across all ZeRO stages.
loss = loss / self.gradient_accumulation_steps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the issue isn't somewhere else?

You can see this is already done here:
https://github.com/deepspeedai/DeepSpeed/blob/b00b75f05852e0791f1e2b9c1cc894cd690e2da4/deepspeed/runtime/engine.py#L2482

and grads are scaled here:
https://github.com/deepspeedai/DeepSpeed/blob/b00b75f05852e0791f1e2b9c1cc894cd690e2da4/deepspeed/runtime/engine.py#L2360

so I suspect the above change is likely to break things, no?

cc: @tjruwase to double check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agree! i was just wanted to test things... reverting

@kashif
Copy link
Contributor Author

kashif commented Dec 11, 2025

closing this

@kashif kashif closed this Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gradient accumulation gives worse results when using DeepSpeed ZeRO 2

3 participants